City Jobs Dataset

We start by importing the data for NYC jobs from Github and cleaning up the column names. Then we remove duplicate listings.

raw_nyc_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/nyc-jobs.csv')
nyc_jobs_df <- raw_nyc_df
names(nyc_jobs_df) <- names(nyc_jobs_df) %>% tolower() %>% gsub("\\.", "_", .)
names(nyc_jobs_df)[names(nyc_jobs_df)=="x__of_positions"] <- "n_of_positions"
nyc_jobs_df <- nyc_jobs_df %>% select(-posting_type) %>% unique()
nyc_jobs_df %>% tail() %>% datatable()


Filter for data-specific jobs

We filter for data science jobs by using a regular expression to search the business_title column for the case insensitive terms “data” or “analytics.” Then we create another table for non-data jobs.

data_jobs_df <- nyc_jobs_df %>% filter(grepl("data|analytics", business_title, ignore.case = T))
## Warning: package 'bindrcpp' was built under R version 3.4.4
other_jobs_df <- nyc_jobs_df %>% filter(!grepl("data|analytics", business_title, ignore.case = T))
data_jobs_df %>% head() %>% datatable()


Analyze quantity and salary

We can see from the above table that a lot of key values are missing, including Job Description and Preferred Skills. This significantly limits our analysis capabilities to only a few areas.


Of all the jobs working for New York City, how many are data jobs?

### Total number of jobs in the dataset:
nrow(nyc_jobs_df)
## [1] 2205
### Number of data jobs:
nrow(data_jobs_df)
## [1] 71
### Data jobs, as a percentage of total:
nrow(data_jobs_df)/nrow(nyc_jobs_df)
## [1] 0.03219955

In this dataset, only about 3.2 percent of jobs are data jobs.


In terms of the high range of salary, how well do data jobs pay relative to non-data jobs?

### Data jobs
summary(data_jobs_df$salary_range_to)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15   65339   84301   78740   99000  161497
### Non-data jobs
summary(other_jobs_df$salary_range_to)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     10.36  56703.25  75595.50  78639.47 101673.00 230000.00
par(mfrow=c(1,2))
boxplot(data_jobs_df$salary_range_to, xlab="Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))
boxplot(other_jobs_df$salary_range_to, xlab="Non-Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))

par(mfrow=c(1,1))

The distribution is wider for non-data jobs, but the median salary is higher for data jobs. It is important to remember that these are all government jobs, which overall may pay less than private sector jobs.


Technology Jobs Dataset

Next we import the data for technology jobs within New York City that were posted to dice.com. We separate one of the columns and rename a few others.

raw_dice_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/dice_com_nyc_jobs.csv', stringsAsFactors = F)
dice_jobs_df <- as_tibble(raw_dice_df) %>% 
  separate(employmenttype_jobstatus, into=c("employment_type", "job_status"), sep = ", ", fill="right", extra = "drop")
dice_jobs_df <- dice_jobs_df %>% 
  rename(advertiser_url = advertiserurl, 
         job_description = jobdescription,
         job_id = jobid, 
         job_location = joblocation_address, 
         job_title = jobtitle, 
         post_date = postdate)
dice_jobs_df %>% head() %>% datatable()


Filter for data science jobs

Since this dataset is comprised of only technology jobs, finding specifically data science jobs may require a bit more nuance. For example, if we try to filter for the words “data” or “analytics” as before, we catch a lot of software developer jobs that are not exactly the same subspace as data science.

ds_dice_df <- dice_jobs_df %>% filter(grepl("data|analytics", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()

We can refine our search by excluding words like “engineer” and “architect” to get a more relevant result.

ds_dice_df <- ds_dice_df %>% 
  filter(!grepl("architect|architecture|engineer|developer|development|administrator|administration", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()


Search for keywords

We can text mine the job_description and skills columns to find specific keywords.


How many job postings mention the R programming language?

r_dice_df <- ds_dice_df %>% 
  filter(grepl(" R | R,", job_description, ignore.case=T) | grepl(" R | R,", skills, ignore.case=T))
nrow(r_dice_df)
## [1] 6
r_dice_df %>% head() %>% datatable()

Of our 59 filtered job listings, six explicitly mention R.


How many job postings mention Python?

python_dice_df <- ds_dice_df %>% 
  filter(grepl(" python | python,", job_description, ignore.case=T) | grepl(" python | python,", skills, ignore.case=T))
nrow(python_dice_df)
## [1] 8
python_dice_df %>% head() %>% datatable()

Of our 59 filtered job listings, eight explicitly mention Python.


Indeed Jobs Scraping

Web Scraping Indeed Job Search Pages

To obtain the latest available information about data science job postings, 50 pages of the latest Indeed job postings are scraped using various html nodes in the rvest package. The resulting dataset contains the following fields: job title, company, location, job summary, and link. To identify the correct html nodes that return these fields, the chrome extention SelectorGadget was used in conjunction with inspect element. Regular expressions are also used for each field to remove blank spaces, new lines, and unnecessary information.

listings <- data.frame(title=character(),
                 company=character(), 
                 location=character(), 
                 summary=character(), 
                 link=character(), 
                 description = character(),
                 stringsAsFactors=FALSE) 

for (i in seq(0, 990, 10)){
  url_ds <- paste0('https://www.indeed.com/jobs?q=data+scientist&l=all&start=',i)
  var <- read_html(url_ds)
  
  #job title
  title <-  var %>% 
    html_nodes('#resultsCol .jobtitle') %>%
    html_text() %>%
    str_extract("(\\w+.+)+") 
  
  #company
  company <- var %>% 
    html_nodes('#resultsCol .company') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  #location
  location <- var %>%
    html_nodes('#resultsCol .location') %>%
    html_text() %>%
    str_extract("(\\w+.)+,.[A-Z]{2}")   

  #summary
  summary <- var %>%
    html_nodes('#resultsCol .summary') %>%
    html_text() %>%
    str_extract(".+")
  
  #link
  link <- var %>%
    html_nodes('#resultsCol .jobtitle .turnstileLink, #resultsCol a.jobtitle') %>%
    html_attr('href') 
  link <- paste0("https://www.indeed.com",link)

    
  listings <- rbind(listings, as.data.frame(cbind(title,
                                                  company,
                                                  location,
                                                  summary,
                                                  link)))
}
## Warning in cbind(title, company, location, summary, link): number of rows
## of result is not a multiple of vector length (arg 2)

We have successfully scraped job title, company, location, job summary, and job link from 100 pages of Indeed job postings.

datatable(listings)

Creating a Unique ID for Each Job Listing

We create a unique ID for each listing based on title, location, company, summary, and description. This unique ID will be used to remove duplicates (since many sponsored posts show up multiple times in Indeed’s postings).

#create a unique id for each job posting attribute combination
listings$uniqueid <- mapply(function(x, y, z) digest(paste0(x,y,z)), listings$title, listings$location, listings$company)

#remove duplicate unique ids
listings %<>%
  distinct(uniqueid, .keep_all = TRUE)

#remove duplicate links
listings %<>%
  distinct(link, .keep_all = TRUE)
datatable(listings)

Scraping Individual Listings for Full Descriptions

The summaries provided in the main job listing pages are somewhat limited. Links to each individual job posting were also extracted in the previous step, so it is possible to iterate theough each link and scrape the full job description. To do this, duplicates are removed and rvest is used again to extract full descriptions.

#obtain full description for all job postings
for (i in (1:length(listings$link))){
  description <- tryCatch(
     html_text(html_node(read_html(as.character(listings$link[i])),'.jobsearch-JobComponent-description')),
     error=function(e){NA}
  )
  if (is.null(description)){
    desc <- NA
  }
  listings$description[i] <- description
}
datatable(listings)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Store Job Listings in AWS SQL Server

Now that full job descriptions are available, regular expressions are used for some minor cleanup. Finally, we store the information from the web scraping in an SQL server. Rather than deleting the SQL table everytime this code chunk is run, we append the new listing to the existing SQL server. This allows any additional job listings to be captured without deleting older job listings. To prevent capturing duplicates in the SQL database, we use uniqueid as the primary key and only append listings with distinct unique IDs.

config <- config::get()

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)

#Query to create new SQL table using uniqueid as the primary key
sqltable <- "CREATE TABLE listings (
 title TEXT,
 company TEXT,
 location TEXT,
 summary TEXT,
 link TEXT,
 description TEXT,
 uniqueid TEXT PRIMARY KEY
)"

#Creates new SQL table "listings" using the query above if it does not exist yet 
if (!dbExistsTable(con, "listings")){
  dbExecute(con, sqltable)
}

#Instead of dropping the table and creating a new one, we append records. Setting uniqueID as the primary key prevents us from appending duplicate job postings to the SQL database. This requires us to append a subset of job listings that are not duplicates. To do this, we import all uniqueid values from SQL and subset listings to exclude duplicates - then import into SQL.
dbWriteTable(con, "dup_listings", listings, row.names = FALSE)
## [1] TRUE
dupcheck <- "
SELECT
  uniqueid,
  MAX(title) as title,
  MAX(company) as company,
  MAX(location) as location,
  MAX(summary) as summary,
  MAX(link) as link,
  MAX(description) as description
FROM dup_listings
WHERE uniqueid NOT IN (SELECT uniqueid FROM listings)
GROUP BY uniqueid;
"
dupcheck_results <- dbGetQuery(con, dupcheck)

if (nrow(dupcheck_results) > 0) {
  dbWriteTable(con, "listings", dupcheck_results, append = TRUE, row.names = FALSE)
}
## [1] TRUE
dbRemoveTable(con, "dup_listings")
## [1] TRUE
dbDisconnect(con)
## [1] TRUE

Import Job Listings from SQL Server

This step imports all the unique data science job listings that we have saved in a SQL table.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)

#Query to get all job listings from SQL database.
sql <- " 
    SELECT uniqueid, title, company, location, summary, link, description
    FROM listings
"

#Show query results
results <- dbGetQuery(con, sql)
dbDisconnect(con)
## [1] TRUE
datatable(results)

Using Keywords to Identify Valuable Skills

Word Cloud

To see which keywords show up most frequenctly in job descriptions, we create a word cloud. From the word cloud, we see that some important skills for a data science are business, machine learning, teamwork, statistics, analytics, research, and modeling. Languages highlighted in the word cloud include Python, SQL, and Java. This word cloud only provides a high-level summary of skills - for a more in-depth analysis, we will search for specific keywords. Note - R is filtered out of this word cloud.

res <- rquery.wordcloud(paste(results$description), type="text", 
        lang="english", excludeWords = c("data", "experience","will","work"),
        textStemming = FALSE,  colorPalette="Paired",
        max.words=500)

Keyword Search for Languages

Below we search for occurance rates of specific languages. The majority of job postings mention R, Python, or SQL. Java and Scala are less commonly listed, but also seem to be useful skills for many job listings.

#R
r_listings_df <- results %>% 
  filter(grepl(" R | R,", description, ignore.case=T) | grepl(" R | R,", summary, ignore.case=T))

#Python
python_listings_df <- results %>% 
  filter(grepl(" python | python,", description, ignore.case=T) | grepl(" python | python,", summary, ignore.case=T))

#Java
java_listings_df <- results %>% 
  filter(grepl(" java | java,", description, ignore.case=T) | grepl(" java | java,", summary, ignore.case=T))

#SQL
SQL_listings_df <- results %>% 
  filter(grepl(" SQL | SQL,", description, ignore.case=T) | grepl(" SQL | SQL,", summary, ignore.case=T))

#Scala
scala_listings_df <- results %>% 
  filter(grepl(" Scala | Scala,", description, ignore.case=T) | grepl(" Scala | Scala,", summary, ignore.case=T))

language_pct <- tibble(
  language = c("R", "Python", "Java", "SQL", "Scala"), 
  mention_pct = c(round(nrow(r_listings_df)/nrow(results),digits=2), 
                  round(nrow(python_listings_df)/nrow(results),digits=2), 
                  round(nrow(java_listings_df)/nrow(results),digits=2), 
                  round(nrow(SQL_listings_df)/nrow(results),digits=2), 
                  round(nrow(scala_listings_df)/nrow(results),digits=2)))
language_pct %>% ggplot(aes(x=language, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="steelblue") + 
  labs(x = "Language", y = "Percent Mentioned")

Keyword Search for Other Tools

We also looked into the percent of job postings that mentioned tools such as Tableau, Spark, Hadoop, and SAS. These results show the prevalence of Big Data - given how often Spark and Hadoop are mentioned.

#Tableau
tableau_listings_df <- results %>% 
  filter(grepl(" Tableau | Tableau,", description, ignore.case=T) | grepl(" Tableau | Tableau,", summary, ignore.case=T))

#Spark
spark_listings_df <- results %>% 
  filter(grepl(" spark | spark,", description, ignore.case=T) | grepl(" spark | spark,", summary, ignore.case=T))

#Hadoop
hadoop_listings_df <- results %>% 
  filter(grepl(" hadoop | hadoop,", description, ignore.case=T) | grepl(" hadoop | hadoop,", summary, ignore.case=T))

#SAS
SAS_listings_df <- results %>% 
  filter(grepl(" sas | sas,", description, ignore.case=T) | grepl(" sas | sas,", summary, ignore.case=T))

tools_pct <- tibble(
  tool = c("Tableau", "Spark", "Hadoop", "SAS"), 
  mention_pct = c(round(nrow(tableau_listings_df)/nrow(results),digits=2), 
                  round(nrow(spark_listings_df)/nrow(results),digits=2), 
                  round(nrow(hadoop_listings_df)/nrow(results),digits=2), 
                  round(nrow(SAS_listings_df)/nrow(results),digits=2)))
tools_pct %>% ggplot(aes(x=tool, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="maroon") + 
  labs(x = "Tool", y = "Percent Mentioned")